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FINITE SAMPLE APPROXIMATION RESULTS 
FOR PRINCIPAL COMPONENT ANALYSIS: 
A MATRIX PERTURBATION APPROACH 1 

By Boaz Nadler 

Weizmann Institute of Science 

Principal component analysis (PCA) is a standard tool for di- 
mensional reduction of a set of n observations (samples), each with 
p variables. In this paper, using a matrix perturbation approach, we 
study the nonasymptotic relation between the eigenvalues and eigen- 
vectors of PCA computed on a finite sample of size n, and those of 
the limiting population PCA as n — > oo. As in machine learning, we 
present a finite sample theorem which holds with high probability 
for the closeness between the leading eigenvalue and eigenvector of 
sample PCA and population PCA under a spiked covariance model. 
In addition, we also consider the relation between finite sample PCA 
and the asymptotic results in the joint limit p, n — > oo, with p/n — c. 
We present a matrix perturbation view of the "phase transition phe- 
nomenon," and a simple linear-algebra based derivation of the eigen- 
value and eigenvector overlap in this asymptotic limit. Moreover, our 
analysis also applies for finite p, n where we show that although there 
is no sharp phase transition as in the infinite case, either as a func- 
tion of noise level or as a function of sample size n, the eigenvector of 
sample PCA may exhibit a sharp "loss of tracking," suddenly losing 
its relation to the (true) eigenvector of the population PCA matrix. 
This occurs due to a crossover between the eigenvalue due to the sig- 
nal and the largest eigenvalue due to noise, whose eigenvector points 
in a random direction. 

1. Introduction. Principal component analysis (PCA) is a standard tool 
for dimensionality reduction, applied in regression, classification and many 
other data analysis tasks in a variety of scientific fields [17, 20]. PCA finds 
orthonormal directions with maximal variance of the data and allows its 
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low-dimensional representation by linear projections onto these directions. 
This dimensionality reduction is a typical preprocessing step in many clas- 
sification and regression problems. 

Assuming the given data is a finite and random sample from a (generally 
unknown) distribution, an interesting theoretical and practical question is 
the relation between the sample PC A results computed from finite data and 
those of the underlying population model. In this paper we consider a spiked 
covariance model for which the underlying data is low-dimensional but each 
sample is corrupted by additive Gaussian noise, and analyze the following 
question: how close are the directions and eigenvalues computed by sample 
PC A to the limiting (and generally unknown) directions and eigenvalues of 
the population model and how do these quantities depend on the number of 
samples n, the dimensionality p and the noise level a. 

Many works have studied the convergence of eigenvalues and eigenvec- 
tors of sample PCA to those of population PCA under various settings. In 
general, the different results regarding convergence can be divided into two 
main types: (i) asymptotic results of classical statistics, where p is fixed and 
n — > oo, starting with the classical works of Girshick [12], Lawley [23] and 
Anderson [1], who assumed multivariate Gaussian distributions, up to more 
recent work which relaxed some of these assumptions; see [2, 17] and ref- 
erences therein; (ii) modern "large p, large n" statistical results, where the 
joint limit p, n — > oo is considered while the ratio p/n = c is kept fixed. In the 
statistical physics literature we mention Hoyle and Rattray [14], Reimann, 
Van den Broeck and Bex [32] , Watkin and Nadal [37] , Biehl and Mietzner [5] 
and references therein, whereas in the statistics community see the recent 
works by Johnstone [18], Baik and Silverstein [4], Debashis [31], Onatski 
[29] and many references therein for older works. However, it seems that 
the most relevant case, that of explicit approximation bounds between the 
eigenvalues and eigenvectors computed in a finite setting (p,n finite) and 
those of the infinite setting {p fixed, n = oo), as well as estimates of the dis- 
tributions of these quantities (again for finite and fixed p, n), are not covered 
by these approaches. 

In the present work we address this problem using a combination of ma- 
trix perturbation theory and concentration of measure bounds on the norm 
of noisy Wishart matrices. This paper contains two main contributions. First 
we present probabilistic approximation results regarding the difference be- 
tween the leading eigenvalue and eigenvector of sample PCA and population 
PCA for fixed p and n, under a spiked covariance model with a single com- 
ponent. The second contribution of this paper is a matrix perturbation view 
of the phase transition for the leading eigenvalue and eigenvector, when 
both p, n are large. We present a simple linear-algebra based proof of the 
asymptotics of the leading eigenvalue and eigenvector in the limit p, n — > oo. 
Second, we present an interesting connection between this asymptotic value 
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and a classical result by Lawley. This observation leads to two additional 
propositions, one regarding the limiting eigenvalues for a more general spiked 
covariance model, and one regarding the spectral norm of a noisy Wishart 
matrix with nonidentity diagonal covariance matrix. These results may be 
useful for inference on the number of significant components under more 
general settings of heteroscedastic correlated noise. Finally, for finite p, n we 
show that while there is no deterministic phase transition at a specific fixed 
value of p/n, either as a function of noise level or as a function of sample 
size, the eigenvector of sample PCA may exhibit a sharp "loss of tracking," 
suddenly losing its relation to the (true) eigenvector of the population PCA 
matrix. This occurs due to a crossover between the eigenvalue due to the 
signal and the largest eigenvalue due to noise, whose eigenvector points in a 
random direction. 

Matrix perturbation theory, including eigenvalue and eigenvector pertur- 
bation bounds, as well as the structure of eigenvalues and eigenvectors of 
arrowhead matrices, play a key role in the analysis of both finite sample 
PCA and the asymptotic limit p,n— > oo. Perturbation theory and concen- 
tration of measure results on the norm of noisy Wishart matrices are key 
ingredients in the analysis of the finite sample case, and also provide novel 
insight into the origins of the phase transition in the joint limit p, n — > oo. 
In the statistics literature, matrix perturbation theory is typically used to 
bound remainder terms in asymptotic limits. In the context of PCA, for 
example, in [10] Eaton and Tyler used a perturbation bound by Wielandt to 
present a simple derivation of asymptotic results as n — > oo for eigenvalues of 
random symmetric matrices, but did not consider the nonasymptotic case. 
In [35] , Stewart introduced a general framework of stochastic perturbation 
theory to analyze the effects of random perturbations on the eigenvalues of 
finite matrices, whereas perturbation bounds for the singular value decom- 
position were considered in [36] . Within the context of the spiked covariance 
model, matrix perturbation theory was used in [4, 19, 31], though these 
works considered mainly the asymptotic limit p, n — ► oo. 

From a practical point of view, our results show that when p,n are large, 
and specifically when p S> n, the true signal directions may be drowned 
by noise since for finite p>n, eigenvector reconstruction errors behave 
as (jy/p/n, as also predicted by the asymptotic analysis of Johnstone and 
Lu [19]. A similar phenomenon also occurs in linear discriminant analy- 
sis [6], and for various multivariate regression algorithms such as partial 
least squares and classical least squares [26]. All these works emphasize the 
importance of regularization, feature selection or low-dimensional represen- 
tation prior to learning, and hint that global methods may not be optimal 
for dimensional reduction or as a preprocessing step prior to regression and 
classification of high-dimensional noisy data, specifically if there is a priori 
knowledge regarding its sparsity or smoothness. The results and approach 
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presented in this paper can also be used to develop methods to determine 
the number of components in a linear mixture (spiked covariance) model 



The paper is organized as follows. In Section 2 we present the spiked 
covariance model and our main results. The results for finite p, n are proven 
in Sections 3 and 4. A matrix perturbation view of the phase transition 
phenomenon in the joint limit as p, n — > oo, as well an analysis of the spiked 
covariance model under more general models of noise and some finite sample 
examples appear in Section 5. 

2. Model, assumptions and main results. 

2.1. Notation. Univariate random variables are denoted by lowercase let- 
ters, as in u, their realizations are denoted by u v and their expectation by 
E{w}. Column vectors are denoted by boldface lowercase letters, as in x, 
their transpose is x r , their jth component is Xj the dot product between 
two vectors is (x,z), and the Euclidean (L2) norm of x is ||x||. Matrices are 
denoted by uppercase letters, as in A. The identity matrix of order p is I p , 
and the spectral norm of a matrix A is \\A\\. 

2.2. The spiked covariance model. We consider a spiked covariance model 
where each data sample x has the form 

k 



where {u{}^ =1 are random variables, typically called components or latent 
variables, {v.;}^ =1 C W are the corresponding fixed (and typically unknown) 
response vectors, £ is a multivariate Gaussian noise vector with identity co- 
variance matrix and a is the level of noise. Equation (2.1) is an error-in- 
variables linear mixture model, commonly assumed in many different prob- 
lems and applications, including independent component analysis (ICA) [15], 
signal processing, and in the analysis of spectroscopic data, where it is known 
as Beer's law [25, 27]. 

We denote by S the px p population covariance matrix corresponding to 
the observations x, 



and by S n the sample covariance matrix corresponding to the n i.i.d. obser- 
vations {x"}" =1 , 



[22]. 






S = E{xx T } 



(2.3) 
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Assuming that all k vectors Vj are orthogonal and that all k random vari- 
ables in (2.1) are uncorrelated with zero mean, unit variance and finite 
fourth moment, it follows that the largest k eigenvalue/eigenvector pairs of 
I] are given by {(||vj || 2 + a 2 , Vj)}j =1 . PC A approximates the eigenvalues 
and eigenvectors of the unknown S by those of S n . In particular, the top 
eigenvectors of S n correspond to orthogonal directions of maximal variance 
of the observed data. The question at hand, then, is how close are the largest 
eigenvalues and corresponding eigenvectors of S n to those of S. 

In this paper we present a matrix perturbation view of this problem. 
For simplicity we consider the case of a single component {k = 1). A simi- 
lar though more complicated analysis can be carried out for the case of k 
components. We thus consider n samples from the model 



(2.4) 



x = uv + £r£, 



where we assume that the random variable u has zero mean, unit variance 
and finite fourth moment. Without loss of generality, we further assume that 
the vector v is in the direction of ei = (1, 0, . . . , 0), for example, v = ||v||ei. 
Finally, since u has zero mean, we neglect in our analysis the initial mean 
centering step typically done in PCA. 

The population covariance matrix corresponding to this one-component 
model is given by 



(2.5) 



E 



v|| 




V 





0/ 



+ <J 2 l 



Its largest eigenvalue is ||v|| 2 + 
all other eigenvalues equal a 2 . 



a 2 with corresponding eigenvector ei, and 



2.3. Results for finite p,n. To study the relationship between the sample 
covariance matrix and the population matrix we introduce the following 
quantities. Let {u^}™ =1 and {£^}™ =1 denote the realizations of the r.v. u 
and of the noise vector £ in the given dataset {x"}£ =1 . Let vpca denote the 
eigenvector of the sample covariance matrix S n with largest eigenvalue Apca • 
Our goal is to find the relation between the finite sample values (Apca, vpca) 
and the limiting values (||v|| 2 + cr 2 ,ei). Since with an exponentially small 
but nonzero probability Apca may be arbitrarily small and (vpcA> e i) may 
be arbitrarily close to zero, we construct highly probable bounds on these 
quantities, for example, bounds that hold with probability at least 1 — e, 
where hopefully £<1. This is common practice both in machine learning 
and in concentration of measure results. 
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As we shall see below, the following quantities come into play in the 
analysis: 

1 " 

(2.6) S l = -J2(un 2 , K=||v||a„, 

, n 1 n 

(2.7) Pi = — E u "^ % -e##- 

Loosely speaking, s u is the second moment of the variable u, which is close 
to unity for large n, k is the "signal strength" for the given dataset, the 
random variables pj capture the signal-noise interactions, and fiij are pure 
noise terms. 

Instead of considering asymptotic results as n — > oo or as p, n — > oo, in our 
analysis we keep p, n fixed but view the noise level a as a small parameter. 
Further, for some of our analysis, we even keep the realizations {u u } of 
the random variable u fixed. The following theorem provides probabilistic 
approximation results in terms of these quantities. 



(2. 



Theorem 2.1. Let s%, S2, S3 > and define 

P 



e = exp 



£2 = Pr 



2(V5 + 2)< 



(29) - --i P -i 

Assume that n <p and that 



X 2 p -i 



1 



> 



Vp 



ei = Pr{|iV(0,l)| >*l}, 
2 j}, £3=Pr{xi>S3}. 



2a si k 



> a\l + y/(p-l)/nf + (p- l)/n) 



(2.10) 



then with probability at least 1 — E\ — £2 — £3 — £, 



Apca > « 



(2.11) 



and 



(2.12) 



1 _ 2aSl + °" 2 p ~ 1 1 ~ ~ 1 

Ky/n k 2 n l + 2asi/Ky/n 

,;Vp-i\ 2 (i + VvFT) 2, 



k 4 V n / (1 — 2a s\ / K^/n)^ 



2as\ a S3 



Apca < K z 1 + — = + 



+ a k 



p — 1 / 2crsi <t 2 S3 



n 



l + ^ + ^(l + S2 /v^I). 
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As for the leading eigenvector, with at least the same probability 



sin0 PCA <^/^(l + 2^Vl + 52 



k\ n V K\/nJ\ Jv — 1 

(2.13) 



+ 4v^ 



2. 1 



K 2 n 1 — (2cisi/(K- v /n)) — a 2 / 'k 2 ' 
w/tere sin0 PC A = V 1 ~ ( v PCA,ei) 2 . 

Remarks. Equation (2.10) can be interpreted as a condition that the 
signal strength is larger than the noise, since the right-hand side is a prob- 
abilistic upper bound on the norm of a noisy Wishart matrix which holds 
with probability at least 1 — e; see (3.3) below. The bounds (2.11), (2.12) and 
(2.13) seem complicated as they involve large deviation bounds on various 
random variables appearing in approximations for Apca an d vpca- When 
both p, n are large, s u is close to unity, so k ~ ||v||. Assuming condition 
(2.10) holds, and neglecting terms 0(1/ y/n) and 0(1/ \/p), gives that 



V +a 71 — TTTT <A PC A<||v|| +Cl||v| L 



n v r \ n J V n 



and sin^pcA ^5 (°7ll v ll)\/p/ n + 0(<r ). Corollary 1 below shows that the 
upper bound on sin#pcA is "sharp," in the sense that (a/\\v\\)^p/n is the 
value expected on average. Similarly, the first two terms in the lower bound 
for Apca are sharp as well. 

Theorem 2.2. For fixed p,n and fixed realizations {u u }™ = i and {^}™ =lJ 
the largest eigenvalue and eigenvector of sample PC A are analytic functions 
of a inside a small interval near the origin. Inside this interval, the Taylor 
expansion of Apca *s given by 

(2.14) Apca = ^ 2 [l + 2%i + ^ (E Pj + ^ + °( ct3 )) 

whereas the Taylor expansion of the corresponding eigenvector vpca is, up 
to a normalization constant, 

(2.15) vpca = ei + -(0, p 2 , . . . , p p ) + 0(a 2 ). 

Equation (2.14) shows that when p^>n, even for relatively small a, the 
0(a 2 ) term may be larger than the 0(a) term, since pi/n = Op(l/(^/n\\v\\)) 
whereas 1/k 2 p] = Op(p/(n\\v\\ 2 )). 
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Corollary 1. For fixed p,n, the mean and variance of the top eigen- 
value as a — > are given by 

(2.16) E{A PC a} = ||v|| 2 + a 2 (l + + 0(a 4 ) + t.s.t. 
and 

(2.17) V ar{ A P0A } _ ||v||^ + f^WEM + Q(al) + t s± 

n n 

Here t.s.t. denotes transcendentally small terms in a. These terms arise 
from the small probability of a crossover between the eigenvalue due to 
the signal and the largest eigenvalue due to the noise, and are of the form 
A(n,p, cr 2 )e _c '( n ' p 'll v ^/ <T , where A(n,p, a 2 ) is analytic in a. 

Note that if u ~ N(0, 1), then we obtain Var{A PC A} = 2/n[\\ v|| 4 + 2a 2 || v|| 2 ] + 
0(o ), which recovers the asymptotic in n result of Girshick for the variance 
of the largest eigenvalue [12]. 

Corollary 2. For fixed p,n and fixed realizations {u v }™ =1) the mean 
o/sin6>pcA> as a— >0, is given by 

The r functions arise from the average of the square root of a chi-squared 
variable. If p 3> 1, then 

E{sin 9pcs) « ^ vs(i - jg^iy + o(£) ) + 0(O + t.s.t. 

and 

2 

Var{sin0p C A} ~ ^-(1 + 0(l/p)) + 0(^ 3 ) + t.s.t. 

2.4. Results for the joint limit p, n —> oo. A second approach to studying 
PCA is to consider the joint limit p, n — > oo with p/n = c; see [3, 4, 9, 14, 29] 
and references therein. A matrix perturbation approach can also be used to 
prove results regarding the largest eigenvalue and corresponding eigenvector 
in the joint limit p, n — > oo. Specifically, we obtain the following theorem. 

Theorem 2.3. Consider the spiked covariance model (2.4) with a single 
component, and assume that the random variable u has finite fourth moment. 
Then, in the joint limit p, n — > oo, 



(2.19) Apca 



^(l + \/-) 2 , */n/p<a 4 /||v|| 4 , 



+ - 2 ) 



1 + ^ 



n llvll 2 



if n/p > cr 4 /||v" 4 
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Similarly, the dot product between the population eigenvector and the 
eigenvector computed by PC A also undergoes a phase transition, 

i? 2 (p/n) = |(v PCA ,v)| 2 

(2.20) 

0, ifn/p<o-y\\v\\ 4 , 
(n\\v\\ 4 )/(pa 4 )-l 



if n/p > cr 4 /||v" 



(n||v|| 4 )/( PC T 4 ) + (||v|| 2 )/ ( 7 2 ' 



Equation (2.20) shows that to "learn" the direction of true largest vari- 
ance, even approximately, the ratio n/p must be larger than a critical thresh- 
old. This is named in the literature as retarded learning, or as the phase tran- 
sition phenomenon. In the statistical physics community these results were 
derived using the replica method [5, 37]. In the statistics literature (2.19) was 
proven by Baik and Silverstein, using the Stieltjes transform [4], for the more 
general case of a spiked covariance model with an arbitrary finite number of 
independent and not necessarily Gaussian components. They showed that 
(with (7=1) all eigenvalues ay > 1 + \fpjn are shifted to ay + cctj/ (ay — 1) , 
and stated that "it would be interesting to have a simple heuristic argument" 
which shows how to obtain these pulled up values. In this section we present 
a matrix perturbation view of this problem, including a simple derivation 
of the pulled up values and some discussion as to the phenomenon for finite 
p,n. A similar approach was recently independently derived by Paul [31]. 

Equation (2.19) shows that for a spiked covariance model with a single 
component and with a = 1, in the joint limit p, n — ► oo, a large enough signal 
eigenvalue a is shifted to 

(2.21) a + - 



na — 1 

We now present an interesting connection between this formula and a classic 
result by Lawley from 1956. In [23], Lawley considered the eigenvalues of 
PCA for multivariate Gaussian observations whose limiting covariance ma- 
trix is diagonal with eigenvalues a\, . . . , a p . Denote by l\, .. .,£ p the noisy 
eigenvalues of PCA corresponding to a sample covariance matrix with a 
finite number of samples n. Then, as n — > oo, with p fixed 

(2.22) E {4} = « fe + ^ £ -^— + o(^\ 

Applying Lawley's result to the spiked covariance model with a single sig- 
nificant component {a.\ = a, 02 = • • • = a p = 1) gives 

Apca = a H + O ^ 

n a — 1 \n z 
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whose first two terms recover the asymptotic result (2.21) of the joint limit 
p, n — > oo. The remarkable point in using (2.22) is that it does not use ex- 
plicit knowledge of the Marchenko-Pastur distribution, regarding the lim- 
iting density of eigenvalues of infinitely large random matrices. We remark 
that although the first two terms in Lawley's expansion provide the asymp- 
totic result as p,n — > oo, this is not due to the higher-order terms all van- 
ishing individually. Rather, in the joint limit p, n — > oo they all miraculously 
cancel each other. Yet, based on this observation, we propose the following 
two theorems. 



Theorem 2.4. Consider a more general Gaussian spiked covariance 
model with k large components with fixed variances ai,...,(Xk an d with the 
p — k remaining components each having a random variance sampled inde- 
pendently from a density h{a) with compact support in the interval [0, a c ] 
(h(a) = for a > a c ). Then, in the joint limit p, n — > oo and for large enough 
values OLj , the first k largest limiting eigenvalues of PC A converge to 

(2.23) Aj = T(aj) = a,- + —aj / — - — h(p) dp. 

n Jo a.j — p 

Theorem 2.5. Let {aj} denote an infinite sequence of i.i.d. random 
variables from a density h(a) with compact support. Let x = (x\, . . . ,x p ) be 
a p- dimensional vector composed of p independent Gaussian random vari- 
ables, where each Xj has variance aj. Let C(n,p) denote the px p empirical 
covariance matrix computed from n independent samples {xj}™ =1 from this 
model. Then, in the joint limit p,n^> oo, p/n = c, the spectral norm of C is 
equal to 

(2.24) lim \\C\\ = a* + ca* [ 9 h(p)dp 
p,rwoo J a * - p 



where a* is the maximal point at which 
(2.25) dT{a) 



da 



0. 



A motivation for the model considered in these theorems is a setting where 
high-dimensional observations are of the type "signal plus noise," but where 
the noise is heteroscedastic and possibly correlated. Thus, in a suitable basis, 
different noise components have different variances, and we only know some 
statistical properties about the noise, such as the distribution of these vari- 
ances. Theorem 2.4 is a generalization of (2.21) that holds for the standard 
spiked covariance model. Theorem 2.5 follows immediately from Theorem 
2.4 according to the following reasoning: In this modified model there is 
also a similar phase transition phenomenon. If the original variance of the 
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signal, a, is larger than a critical value a* (h,p,n), then it will be pulled up 
from the noise in the limit p,n — > oo. Further, for all a > a* , this pulled up 
value is monotonic in a. From Theorem 2.4, the critical value a* satisfies 
(2.25), and at that point according to our matrix perturbation analysis, the 
value T{a*) is equal to the spectral norm C — the covariance matrix of the 
noise. We remark that a formula similar to (2.24) was recently derived by El 
Karoui [11], who also studied the finite p,n fluctuations around this mean. 

Corollary 3. Consider the general spiked covariance model as in The- 
orem 2.4, and assume c = p/n^> 1. Let 

H\ = J ph{p) dp, $ = J(p- pifh{p) dp. 

Then, in the joint limit p, n — > 00 , the norm of a pure noise matrix is ap- 
proximately 



(2.26) A(a*) = /ii(c + 2v^/l + ^! + 0(l)). 



Further, the phase transition phenomenon for the pulled up eigenvalues oc- 
curs at 



(2.27) a* = Ml f vW 1 + 7§ + 1 + 4 2 + 0(1/ y/c) 



p\ 1 + 2$/ 

Equation (2.26) may be useful for inference on the number of components 
in a general spiked covariance model, given that the first two moments pi, P2 
of the density h(p) of the noise are either known a priori or estimated from 
the data. 

3. Proof of Theorem 2.1. To prove Theorem 2.1 we shall use the follow- 
ing three lemmas: 

Lemma 1. Let A,B be p x p Hermitian matrices. Let {Aj}^ =1 denote 
the eigenvalues of A sorted in decreasing order with corresponding eigen- 
vectors Vj. Let Pi denote the projection into the orthogonal subspace o/vj, 
PjV = v — (v, Vj)vj. If X\ has multiplicity 1 and \\B\\ < X\ + (vx, -Bvi) — A2, 
then the largest eigenvalue of A + B satisfies the bounds 

(3.1) (vi.Bvi) < Xi(A + B) - Ai < (vi.Bvi) + ||i\Svi||. 

Lemma 2. Let X denote an n x p matrix with entries Xu all i.i.d. 
N(0, 1) Gaussian variables, and let W = X T X/n be the corresponding scaled 
Wishart matrix. Define 
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W\\ < (1 + \ p/n) +p/n 



then for n<p with probability at least 1 — e, 

(3.3) 
and 
(3.4) 



|W — 7p|| <4 



P 



n 



Lemma 3. Let A be a p x p Hermitian matrix and let B be a Hermitian 
perturbation. Let (A,v) be the eigenvalue/vector pair ofA + B correspond- 
ing to (Aj,Vj) of A, and let 5 = min^ | A — Xj \ , where {\j}? =1 are all the 
eigenvalues of A; then 

11-611 

(3.5) sin0(v, Vi )<U. 

o 

Lemma 1 follows from classical results in matrix or operator perturbation 
theory. According to [30], Theorem 4.5.1 (see also [13], Theorem 6.3.14), for 
each scalar u, vector x and Hermitian matrix A, there exists an eigenvalue 
A of A such that 

|A — /i| < \\Ax — ux\\/\\x\\. 

Applying this theorem to the matrix A + B, with x = Vj a normalized eigen- 
vector of A and with fi = Aj + (vj,Bvj), gives that 

\X(A + B) - \i(A) - (vi, J3vi)| < \\PiBvi\\. 

The condition of Lemma 1 ensures that the largest eigenvalue of A is the 
one closest to the largest eigenvalue of A + B, for example, % = 1. For the 
analysis of a spiked covariance model with more than one component, sim- 
ilar statements can be made for interior eigenvalues as well [16]. Lemma 2 
follows from Theorem 11.13 of Szarek and Davidson [7] and is proved in the 
Appendix. Lemma 3 is known as the sin# theorem of Davis and Kahan [8]; 
see also [30], Theorem 11.7.1. 



Proof of Theorem 2.1. Let {x^}™ =1 be n i.i.d. observations from 
the one-component model (2.4). We decompose the corresponding sample 
covariance matrix as follows: 
















+ OK 



(2pi 

P2 



P2 





Pp\ 





V 



0/ 



V Pp o 



0/ 
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(3.6) 



(P. 



1.1 



2.1 



01,2 
/?2,2 



aC\ + a 2 C 2 , 



0i,p\ 



where k, pj and /3y are denned above in (2.6) and (2.7). Note that conditional 
on the realizations {u v Yv=\ of the random variable u kept fixed, pi = ^?7i, 
where r]i are all i.i.d. N(0, 1), = Xn/ n an d that £2 is a Wishart noise 
matrix. The matrix £0 can be thought of as the "signal," the matrix C\ as 
the signal-noise interactions, whereas £2 contains pure noise. 

To derive the lower bound (2.11), we view the matrix cr 2 £2 as a pertur- 
bation of the matrix Cq + aC\. Since the matrix £2 is nonnegative (it is a 
scaled Wishart matrix), it follows that 

Apca > ||£o + o"£i||- 
The matrix Cq + aC\ has rank 2 with the following two nonzero eigenvalues: 

(k 2 + 2an Pl ) ± J{k 2 + 2o-k Pi ) 2 + A{uk) 2 J2j> 2 p) 



(3.7) 



where A+ is positive and A_ is negative. Since conditional on the realizations 
u u fixed, the random variables pj are independent Gaussians, we define 



(3.8) 



Hp) 

i>2 



Ti 

n 



where T\ 

Using the inequality \f\ + x > (1 + x/2 — x 2 /8) in (3.7) gives 

(™) 4 



A+ > (k + 2anpi) 



1 + 



a 2 K 2 



T 2 



(k 2 + 2anpi) 2 n 
Therefore, with probability at least 1 — s\ — e 2 



(k 2 + 2(JKpi) A V? 



Apca>A+>k z 1 



2(TS\ 



Ki/n 



1 + 



2 



1 i-s 2 /Vp~=T 



n 1 + (2a s{) / (ny/n) 



ac 4 n 2 (1 - (2cjsi)/(k^)) 3 

which proves the lower bound (2.11). 

To prove the upper bound on Apca> we use Lemma 1 with the matrix 
aC\ + a 2 C 2 as a perturbation of £q. Choosing ei as an eigenvector of £q 
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and applying the lemma gives that 



Apca < k 2 + 2attpi + a 2 Pu + a j^i^pj + <rPij) 2 . 

Vi> 2 

Conditional on fixed realizations u v and on fixed noise realizations in the 
first component of the data, we have that 

1 



Kpj + afiij = —=Jk 2 + 2<t np\ + a 2 Pnr]j, 



where the rjj are independent standard Gaussian variables. Therefore, 



Apca < (« + 2cjk/9i + a /?n) + cry k 2 + 2<tk/9i + a 2 [3ii\\ — , 

where the random variable T~Xp-i) an d is independent of pi and /?n. 
Therefore, with probability at least 1 — e — £\ — £2 — £3 the bound of (2.12) 
follows. 

To prove a bound on the eigenvector vpca> we start from the eigenvector 
v + corresponding to A+, and given by 

1 



(3.9) 



UK T 

ei + T— {0,p 2 ,...,Pp) 

where Z = y'l + a 2 K 2 T\/n)^_ is a normalization constant such that ||v.| 
Simple algebraic manipulations and the triangle inequality give 



sin6»p C A = V 1 - (vpcA,ei) 2 = \J 1 + |(v PC A,ei)|yT - |(v PC A,ei) 

'|vpcA - ei|| 



(3.10) = v /l + |(vp C A,e 1 )| 

< || vpca — ei || < || vpca - v+ 1| + || v+ - ei I 
From (3.9), a bound on the second term in (3.10) is 



(3.11) || v+ _ ei || = ^l-^<^. 

For the first term in (3.10), applying the sin# theorem (Lemma 3 above) with 
the matrix a 2 (W — I p ) = a 2 (C2 — I p ) as the perturbation of Co + aL\ + cr 2 I p 
gives 



||vp C A - v+|| = V2\/l - |(v PC A, v+)| < v / 2sin^(vpcA,v + ) 

(3.12) 



FINITE SAMPLE APPROXIMATION RESULTS 



15 



where S = min^i | Apca — Aj(£o + c£i + <j2 -^p)I ■ Therefore, combining (3.11) 
and (3.12), 

(3.13) sin^p C A<^W — + a 2 V2 U 



A_|_ y n 5 

To conclude the proof we apply almost sure bounds for \\W — I p \\ and T\ 
from above and 5 and A+ from below. Bounds on T\ and A+ are identical to 
those described in the proof of (2.11) of the theorem. To bound \\W — I p \\ 
from above, we use (3.4) of Lemma 2, which states that with probability at 
least 1 — e 

(3.14) ||W-/J|<4-. 

n 

As for a bound on 5, from the first part of the proof, if k 2 + 2aKp\ > c 2 [(l + 
y/p/n) 2 +p/n], 

Apca > A + > k 2 + 1ok>p\. 

Furthermore, the eigenvalues of the matrix Co + C\ + a 2 1 are (A + ) + a 2 , a 2 
or (A_) + <r 2 , with A-t given by (3.7). Therefore, 

5 = min |Apca — A, (£o + C\+ a 2 I) \ = Apca — cr 2 > k 2 + 2anpi — a 2 
and with probability at least 1 — ex , 

(3.15) 5>« 2 -2 Sl ^-a 2 . 

Combining (3.13), (3.14) and (3.15) concludes the proof. □ 

4. Proof of Theorem 2.2. We now explore the leading order terms in a 
of the explicit dependence of Apca and vpca on p,n, and on the specific 
signal and noise realizations {n v }™ =1 and {^ 1/ }'™ =1 . To this end, we view a 
as a small parameter and consider the Taylor expansion of Apca and vpca 
as <t — > 0. By definition, the largest eigenvalue Apca is the largest root of 
the characteristic polynomial of sample covariance matrix S n . For a = 
this eigenvalue is a simple root with multiplicity 1. Therefore, given a finite 
dataset {x 1/ }'™ =1 with s u > 0, the largest eigenvalue A(cr), when viewed as 
a function of noise level, is an analytic function of a in the complex plane 
for small enough a. This statement follows from the representation of the 
empirical covariance matrix as S n = Co + <?C\ + a 2 C2, (3.6), with all matri- 
ces being symmetric, together with standard results regarding perturbation 
theory for linear operators; see, for example, Kato [21], Chapter 2, Theo- 
rem 6.1. Moreover, the radius of convergence of a Taylor series of A(<r) is 
the largest complex a for which Apca(o') > A2 (cr) where A2(<r) is the second 
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largest eigenvalue of the noisy covariance matrix. Note that the location of 
the crossover depends on the specific signal and noise realizations. Finally, 
since the matrix S n is symmetric A(<r) is real when a is real. 

Therefore, for small enough a we can expand both the top eigenvalue and 
its corresponding eigenvector as a regular power series in a: 

vpca = v + avi + a 2 v 2 H , 

Apca = Ao + crAi + a 2 X 2 + • • • • 

We insert these expansions into (3.6) and equate terms with equal powers 
of a. The first few equations read 

£o v o = Aovo, 

£ vi + C±v = A vi + Aiv , 

C v 2 + Avi + C 2 v = A v 2 + A1V1 + A 2 v . 

Iteratively solving these equations gives 



(4.1) 
and 



A = k 2 + 2<jk Pi +<J 2 (j2p 2 j+ 011 ) + °( cj3 ) 

Vj>2 / 



a . 
v = ei + -(0,p 2 ,---,P P ) 

(4-2) 

+ a — [(0, /3i2, • • • , Pip) ~ 2pi(0, P2, ■ ■ ■ , P P )} + O ( ^ 

Note that up to order 0(a 2 ), the eigenvalue Apca and the corresponding 
eigenvector depend only on the first row of the noisy matrix, for example, 
only on the interaction between signal and noise. 

Equations (2.16), (2.17) and (2.18) follow by taking expectations on ex- 
pressions (4.1) and (4.2) for Apca and vpca, respectively, and retaining only 
the leading terms in a. However, an important remark is that (4.1) and (4.2) 
are the Taylor expansions of the eigenvalue and eigenvector that are ana- 
lytic in a and correspond to k 2 and ei when a = 0. As such, these are not 
necessarily expansions of Apca and vpca — the actual largest eigenvalue and 
corresponding eigenvector of the sample covariance matrix. This is because 
for finite a > there is a nonzero probability that the largest eigenvalue is 
one due to noise and not the one described by (4.1). From Lemma 2, the 
probability of such a crossover, between the eigenvalue due to noise and 
the eigenvalue due to the signal, can be bounded by an expression of the 
form A(n,p) exp(— C(n,p)/a 2 ). Therefore, by taking expectations of (4.1), 
we introduce trans cendentally small error terms in a. 
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5. Proof of Theorem 2.3: the phase transition phenomenon. 



5.1. A simple heuristic for the location of the phase transition. First, we 
present a simple heuristic explanation for the phase transition phenomenon, 
but for fixed p,n, as a function of noise level a. Obviously, for fixed p,n 
there is no deterministic phase transition at a fixed constant c = p/n, only 
an increasing probability for losing the relation between the direction of 
maximal variance and the limiting vector ei. From the analysis of Section 
3, this occurs when the largest eigenvalue of the sample covariance matrix is 
of the same order of magnitude as that of the noise matrix E, Apca ~ ll-^ll- 
From (3.4), this occurs roughly when 



k + a 



n 



This gives 

p Ik 4 
n ~ To 1 ' 

which up to a multiplicative constant has the same functional dependence 
on the parameters p,n,o~ as the true location for the phase transition in 
(2.19) and (2.20). 

5.2. An exact analysis of the phase transition. We now present a sim- 
ple linear-algebra based derivation of the exact pulled up value for a spiked 
population model with a single component. A similar though more compli- 
cated analysis applies for the general /c-component model. For simplicity, we 
perform our analysis for p/n = c = 1, and without loss of generality assume 
a = l. 

To this end, we decompose the p x p sample covariance matrix computed 
from n samples as follows: 



Sri 



(k 2 + 2k Pii +/3 n b 2 
b 2 



V b p 







0/ 



+ 



<l 



Vo 





022 

032 
Pp2 



P2p 



fop 
fipp 



where bj = Kpj + Note that the second matrix, which is the minor of the 
covariance matrix obtained by deleting the first row and column, is just a 
(p — 1) x (p — 1) scaled Wishart matrix. Let X 2 , ■ ■ ■ ,X P denote its eigenvalues 
and let V pxp be the matrix of its eigenvectors padded with zeros in the 
first coordinate. We perform a change of basis whereby this matrix becomes 
diagonal. Then the full covariance matrix takes the following form: 

/K 2 + 2 K/ oi+/? n b 2 ■■■ b p \ 
b 2 \ 2 ■■■ 



(5.1) 



VS n V T 
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where b\j are the entries of the first row and column in the new basis. The 
specific form (5.1) is known as an arrowhead matrix [28]. Assuming that 
bj 7^ for all j (an event with probability 1), the eigenvalues of this matrix 
satisfy the following secular equation: 

(5.2) /(A) = (A - k 2 - 2k Pi - M - £ -5- = o. 

3=2 A " X 'J 

Recall that bj = p,- + (3±j are the entries of the first row and column in the 
new basis. They are given explicitly as 

n 1 n 

Pj = — Yj u Wv fa = - E iiih 

ns v ^— J J J n ^— ' J 

» v=l u=x 

where £ u are the noise vectors in the rotated basis in which the (p — 1) x (p — 1) 
submatrix of noise covariances is diagonal with eigenvalues Aj. Therefore, 
conditional on £j having variance Xj, the quantity bj is iV(0, Xj{n 2 + 2npi + 
/?n)/n). Therefore, 

where rjj are all i.i.d. N(0, 1) and independent of Aj. Furthermore, in the 
limit p,n — > oo, p/n = c, the distribution of eigenvalues Xj of the submatrix 
converges to the Marcenko-Pastur distribution [24], 

fMp(x) = l^r c \J ( b ~x)(x- a), xe[a,b], 

where a = (1 — ^/c) 2 ,b = (1 + \/c) 2 . In addition, as n,p — ► oo, k 2 — > ||v|| 2 , 
pi = Op(l/y/n) — ► and /3n = Xn/ n ~~ * 1' an with probability 1. Therefore, 
as p,n— > oo, the sum in (5.2) converges with probability 1 to the following 
integral: 

(5.3) lim £^L = (||v|| 2 + l)^ f b f Mp{x) * dx . 
p,n^oo f— ■i A — A, n J a X — x 

This integral is a linear functional of the Marcenko-Pastur distribution, 
with some similarity to its Stieltjes transform. We remark that the Stieltjes 
transform was used extensively in deriving results regarding the eigenvalues 
of random matrices [24, 33]. 

This integral can be evaluated explicitly. For example, for c = 1, 

(5.4) f 1-^/(4- X ) x -J— dx = l -[X-2- yWA-4)]. 
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Thus, the largest eigenvalue satisfies a quadratic equation in A, whose solu- 
tion is 

. , . a 

Afaj = a + c - 

a — 1 

with a = ||v|| 2 + 1. However, this solution is indeed the largest eigenvalue 
only if A(a) > (1 + \/c) 2 . This recovers the pulled up value and the exact 
location of the phase transition, (2.19). 

To prove (2.20) for the eigenvector overlap, we note that eigenvectors of 
arrowhead matrices have also a closed form expression [28]. Let A be an 
eigenvalue of the arrowhead matrix (5.1); then up to normalization, the 
corresponding eigenvector is given by 



(5.5) 

Therefore, 



bp 



A — A2 A — X p 



R 2^{v,eiy_ 1 



HI 2 i + E,> 2 ^/(A-A^ 

In the joint limit p,n^ 00, similar to the analysis above, the sum in the 
denominator converges with probability 1 to an analogous integral 

(5.6) lim R 2 



p,n^oo,p/n=c 1+p/n j Q/i/(A — ^) 2 ./mP (aO d/J, 

Evaluation of the integral gives the asymptotic overlap, (2.20). 

5.3. A classical result of Lawley and two theorems. While Theorems 
2.4 and 2.5 are motivated by the classical result of Lawley, (2.22), their 
proof relies on results from random matrix theory regarding the limiting 
empirical density of eigenvalues of sample covariance matrices in the joint 
limit p,n— > 00. Before proving these theorems, we first briefly review the 
results required for our proofs. 

The key required quantity is the Stieltjes transform of a probability den- 
sity h(t) defined as 

(5.7) m h {z)=j^-dt VzeC+. 



t 

Let S n = l/nZ T Z be the pxp sample covariance matrix of n observations 
Zj G W from the model described in Theorem 2.5, and denote by F n the 
empirical distribution function of the eigenvalues of S n , 

(5.8) F n (t) = { *^±. 

P 
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As proven in [33] , in the limit p, n — > oo, F n converges with probability 1 to a 
limiting distribution F with no eigenvalues of S n outside the support of this 
limiting distribution. Although the explicit form of F can be computed only 
in a handful of simple cases, its Stieltjes transform satisfies the equation 

h(t) 



(5.9) 



m{z) 



<(1 



m(z)z) 



■dt. 



One can also consider a different matrix, S n = l/nZZ T of size n x n. Since 
the matrices S n and S n have the same nonzero eigenvalues and differ by 
\p — n\ zero eigenvalues, their respective empirical distribution functions F 
and F are related as follows: 

F=(l-c)/ [0iOo) + cF. 

Due to linearity of the Stieltjes transform, 

1 -c 



(5.10) 



m{z) 



+ cm(z). 



Obviously, when c = 1, m{z) = m(z). 

The last result of interest is an inverse equation for m{z), which reads 



(5.11) 



zlm) 



1 

-— + c 

m 



t 



1 + tm 



h(t) dt. 



Proof of Theorem 2.4. For simplicity, we consider a spiked covari- 
ance model with a single spike, assumed in the direction ei . Let ai be fixed 
and sufficiently large, and let {oy}j>2 denote a sequence of i.i.d. realizations 
sampled from a density h(a). Consider n i.i.d. random vectors {x l/ }" =1 from 
a model 

v 

x = ^/olyiei + ^2 V®]yj e ji 

J =2 

where yj are all i.i.d. Gaussian N(0, 1) random variables. We view the direc- 
tion ei as the signal direction with all other directions as noise, and write 
the corresponding sample covariance matrix as follows: 

( k 2 h ■■■ b p \ 



Sri 



Cm 



where 
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and C n is the (p — 1) x (p — 1) sample covariance matrix of the pure noise 
components. 

Let Vo be a (p — 1) x (p — 1) matrix that diagonalizes the pure noise 
matrix C n , and consider the p x p unitary matrix 

1 



In the basis defined by V, the sample covariance matrix takes the form 

/if bi b 2 ••• 6 P \ 
bi fi\ 

h fJ>2 



(5.12) 



vs n v- x 



\b p \i p l 

where bj is the projection of the vector b on the jth basis vector of the 
matrix V, 

n 



'J 



b 1 Vi 



1 



V - v 



As in (5.1), the matrix (5.12) has the form of a symmetric arrowhead matrix 
and assuming all bj / 0, its eigenvalues are solutions of 

v 

(5.13) \- K 2 = Y^ 



We now consider the joint limit p, n — > oo, p/n = c. Similarly to the analysis 
of Section 5.2, the sum in (5.13) converges with probability 1 to 



(5.14) 



A — fi 



dF(n), 



where F{n) is the limiting probability distribution of a pure noise random 
matrix corresponding to the density h(a). Therefore, the pulled up value is 
the solution of 



(5.15) 



A — OC\ = COL\ 



■ dF(fi). 



A — /x 

To finish the proof, we use Lemma 4 below, which shows that for any value 
of c and density h, m(A(a)) = — 1/a, and then insert this relation into the 
inverse equation (5.11). □ 

Lemma 4. Let A(a) denote the pulled up value corresponding to an orig- 
inal eigenvalue a. For a large enough, regardless of the constant c and of 
the underlying density h(t), 

1 

a 



(5.16) 



m(A(a)) 
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Proof. We rewrite (5.15) as follows: 

(5.17) A(q) = a(l - c) + Aac / —— dF(p). 

By extending the definition of the Stieltjes transform m(z) of F, originally 
defined only for z £ C + , also tozGl with z > support(F), the last equation 
reads 

(5.18) X(a) = a(l — c) — \acm(\(a)) 
or stated otherwise 

(5.19) cm ( A ( a ))-I^ = -I, 

X{a) a 

but according to (5.10), the left-hand side is simply fh(X(a)), also extended 
to the case of inputs zGl with z > support(F). □ 

Proof of Theorem 2.5. We consider the relation a(A). That is, for 
each A > support(i ? ) we look for the corresponding a such that (2.23) is sat- 
isfied. We show that there exists a unique solution a(A), which is monotonic 
in A, and as A — ► support(F), satisfies that a(X) — ► a* < oo, but da/dX — > oo. 

To this end, we rewrite (5.17) as follows: 

1 

(5 ' 20) a ~ {l-c)/X + cjl/{X-ii)dF{iiy 

Since for A > support(F) the functions 1/A and J 1/(A — n) dF(fi) are both 
strictly monotonically decreasing in A, it follows that 

dcx 

(5.21) — — > for A > support(F). 
dX 

We now analyze the behavior of both a(A) and its derivative as A approaches 
the support of F. According to [9, 34] near the boundary b = support(F), 
the density F exhibits a behavior closely resembling y/\b — x\ . Therefore, 

(5.22) lim / —— dF(fi) = Const, 

A^support(F) J X — fx 

whereas 

d f 1 

(5.23) lim — — — / — dF(fi) = oo. 

A-+support(F) dX J X — H 

This proves Theorem 2.5. □ 

Proof of Corollary 3. We start our analysis from the equation 

(5.24) X(a) = ail - c) + a 2 c [ dt. 

J a — t 
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First we make a change of variables t = Hi + s, where 



(5.25) 



Mi 



th(t) dt 



and denote Hq(s) = h(t), with the subscript zero denoting the fact that this 
density has zero mean. Then, 



(5.26) 



A(a) = a(l — c) + 



a 2 c 



ho(s) 



ds. 



a — fi\ J 1 — s/(a — hi) 

As c — ► oo, both a — > oo and A — ► oo. Specifically, a — Hi > support(F). In 
this case we can expand 

! .f(^-)\ 

l-sf(a-Hi) ^ \a-ViJ 

where the sum is convergent for |s| < support(i ? ). Inserting this expansion 
into (5.24) gives 



(5.27) A(a) = a(l - c) + 



a 2 c 



a — Hi 



1 + 



■s 2 ^o(s) 
(a - hi) 2 



ds + O 



1 



a — Hi 



Taking only the first two terms in the expansion gives the approximate 
solution 

q* = hiO- + s/c)- 

To obtain the correction due to the variance of the distribution, we expand 

L 



(5.28) a* = h[PiV^ + Po + -^ + --- 



and insert the expansion into (5.27). This gives 



(5.29) 



a 



fi 



/'2 



V Mi 1 + 2M2/M1 



01 ^5 



for the location of the phase transition, and 
(5.30) 



A(a*) = h 



c + 2^1 + ^1 + 0(1) 



□ 



5.4. T/ie phase transition phenomenon for finite p,n. We conclude by 
presenting two examples of the phase transition phenomenon for the stan- 
dard spiked covariance model with finite p,n. In Figure 1 we present an 
example of this phenomenon with n = 50, p = 200, k = 2.8. For small noise 
level a, the largest eigenvalue is roughly k 2 = 7.87, and (vpca^i) w 1. As 
the noise level a increases the dot product decreases smoothly. However, at a 
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noise level of roughly a = 1.85 the largest eigenvalue and the second largest 
eigenvalue "cross" each other, leading to a sharp decrease in the overlap be- 
tween the first eigenvector of PCA and the correct direction ei. The reason 
is that at this level of noise the spectral norm of the noise matrix overcomes 
the eigenvalue corresponding to the original signal. Thus, after the crossing 
the corresponding leading eigenvector is due to noise and points to a random 
direction. 

As a second example, we present a possible behavior of R = | (vpca, ei) | as 
a function of the number of samples n. In Figure 2 we present a simulation 
result with p = 600, a = 1, ||v|| = 2 and a variable number of samples n. 
When n is very small the signal information is too weak, and as expected 
the overlap R is very small. Then, as n increases the largest eigenvalue 
of the noisy covariance matrix corresponds to the correct one and there 
is an increase in R. However, in this example at n = 77 there is a short 
crossover between the dominant eigenvalue and one due to noise (e.g., A2 ~ 
Ai), leading first to a sharp decrease in R and then to a sharp increase in 
R back to around 0.5. The formulas derived in this paper can be used to 
estimate the probability of such a loss of locking to occur. This analysis also 
shows that care should be taken when applying bootstrap procedures for the 
eigenvectors, since in the case of weak signals in certain subsets of samples 
the resulting leading eigenvector might be due to noise. 

APPENDIX: PROBABILISTIC BOUNDS ON THE NORM OF 
WISHART MATRICES 

Let r denote an n x p matrix whose entries are all i.i.d. iV(0, 1) random 
variables. Consider the p x p scaled Wishart matrix W = (l/n)r T r and the 
matrix E = W — I p . In this Appendix we provide a probabilistic bound on 
the spectral norm of the matrices W and W — I p . 




Fig. 1. Loss of trucking of the correct direction as a function of noise level. 
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Fig. 2. Loss of tracking of the correct direction as a function of sample size n. Bottom 
hs are zoomed-in versions of the top ones. 



As shown in [18], in the limit n,p — > oo, p/n = c < 1, the largest eigen- 
value Xw of the matrix W is distributed according to 

1 1 \ X /3 1 



Xw = - 

n 



where Wi follows a Tracy- Widom distribution of order 1. Therefore, we seek 
a bound on the norm of E of the form \\E\\ < (1 + -y/c) 2 + c — 1 + const, that 
will hold with probability 1 — e. To prove such a bound (Lemma 2) we use 
the following proposition by Davidson and Szarek [7], which holds for finite 
p, n, with n<p. 

Proposition (Theorem 11.13 in [7]). Let T be an n x p matrix with 
n<p whose entries are all i.i.d. N(0, 1) Gaussian variables. Let s\(T) be 
the largest singular value ofT; then 

(A.l) Pr{si(r) + v^ + VP*} <exp(-pt 2 /2). 

Proof of Lemma 2. The largest eigenvalue of W = (l/n)r T r is given 
by Xw = [si(P)] 2 / n - Therefore, 



Pr{A^ > (1 + Jp~jnf + ap/n} = Pv{ Sl (T) > JpJ (1 + ^n~fpf + a} 
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We write 



with 



VpJ (1 + yn/p) 2 + « = v 7 ™ + VP + v 7 ^ 



t = a/ (1 + A/n/p) z + a - (1 + Jn/p) 



and use (A.l) to obtain that 



(A.2) 



Pt{\ w > (1 + V^/n) z + ap/n} 
< exp 



I' 



(Ja + (l + v /^) 2 -(l + v /^)) 2 



< exp 



n 



'a + (1 + v^Tp) 2 + 1 + Vn/P 
We specifically consider a = 1. Then, since n<p, 

(A.3) 



1 



i + (i + vWp) 2 + 1 + ^ + 2 



and 



(A.4) Pr{X w > (1 + V^) 2 +p/n} < expj- + — } 

Similarly, for n/p< 1, with probability at least 1 — e, 
(A.5) ||£|| 



W — 711 < (1 + A/p/n) 2 + - - 1 < 4~. 

v n n 



□ 



Remark. For n<p the matrix T T T always has p — n eigenvalues equal 
to 0, therefore E has p — n eigenvalues equal to —1. Thus, to bound \\E\\ we 
need only bounds on the largest positive eigenvalue as analyzed above. 
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